Journal of Proteome Research — Latest Matching Preprints

1

One-Pot Quantitation of Mycobacterial N-terminal Protein Acetylation Peptidoforms and Proteome

Hu, D. D.; Weaver, S. D.; Jones, B. S.; Champion, P. A.; Champion, M. M.

2025-03-31 biochemistry 10.1101/2025.03.30.646241 medRxiv

Top 0.1%

84.6%

Show abstract

Isotopic labeling of proteins for quantitative proteomics is a popular technique to increase sample throughput and provide improved accuracy and precision for relative or absolute quantitation between samples. Derivatives of this technique are used to label protein and peptide N-termini for selective enrichment and analysis. We previously reported on a method to enrich and quantify protein N-terminal acetylation in model-system and pathogenic mycobacteria. Significant recent advancements in silica filter-based protein digestion have improved identification of proteins in bottom-up proteomics. However, these are not yet compatible with existing methods which detect and quantify protein termini and N-terminal modifications. Here, we present a one-pot method (OnePotNTA) that incorporates silica filter digestion with protein N-terminal labeling and subsequent quantitation. This technique achieves high-density coverage of the N-terminome and obviates the need to enrich N-terminal peptides prior to analysis. OnePotN TA identified 54% of the canonical proteome from whole-cell lysates of Mycobacterium marinum and eliminated biases in peptides identified by forgoing enrichment. This significantly reduced sample preparation time by [≥]5-fold and preserves protein-level abundance measurements (LFQ) from the same injections and analysis. Analysis of a mutant strain of M. marinum lacking Emp1 ({Delta}emp1)-an N-terminal Acetyltransferase required for efficient pathogenesis-identified 37 putative substrates of this enzyme. Additionally, analysis of the remaining peptides identified at least 34 proteins with alternate true N-termini, distinct from the canonical genome.

2

MARLOWE: Taxonomic Characterization of Unknown Samples for Forensics Using De Novo Peptide Identification

Jenson, S. C.; Chu, F.; Alves, G.; Ogurtsov, A. Y.; Barente, A. S.; Crockett, D. L.; Lamar, N. C.; Merkley, E. D.; Yu, Y.-K.; Jarman, K. H.

2025-06-02 bioinformatics 10.1101/2024.09.30.615220 medRxiv

Top 0.1%

81.3%

Show abstract

We present a computational tool, MARLOWE, for source organism characterization of unknown, forensic biological samples. The intent of MARLOWE is to address a gap in applying proteomics data analysis to forensic applications. MARLOWE produces a list of potential source organisms given confident peptide tags derived from de novo peptide sequencing and a statistical approach to assign peptides to organisms in a probabilistic manner, based on a broad sequence database. In this way, the algorithm assumes no a priori knowledge of potential sources, and the probabilistic way peptides are taxonomically assigned and then scored enables results to be unbiased (within the constraints of the sequence database). In a proof-of-concept study, we examined MARLOWEs performance on two datasets, the Biodiversity dataset and the Bacillus cereus superspecies dataset. Not only did MARLOWE demonstrate successful characterization to true contributors in single source and binary mixtures in the Biodiversity dataset, but also provided sufficient specificity to distinguish species within a bacterial superspecies group. We also compared MARLOWEs results to those of MiCId, a leading microbial identification/characterization tool based on proteomics database search. Comparison of the two tools using 225 mass spectrometry data files yielded comparable performance, with slightly higher accuracy and specificity for MiCId. At the species level, MARLOWE achieved a specificity of 91.4% at 5% FDR. These results suggest that MARLOWE is suitable for candidate- or lead-generation identification of single-organism and binary samples that can generate forensic leads and aid in selecting appropriate follow-on analyses in a forensic context.

3

Joint Protein Inference Analysis with PyProteinInference Elucidates Biological Understanding of Tandem Mass Spectrometry Data

Hinkle, T. B.; Bakalarski, C. E.

2024-10-20 bioinformatics 10.1101/2024.10.17.618892 medRxiv

Top 0.1%

77.3%

Show abstract

Selection and application of protein inference algorithms can have a significant impact on the data output from tandem mass spectrometry (MS/MS) experiments, yet its use is often an afterthought in proteomics research due to the inability to apply different inference algorithms in existing analysis systems today. PyProteinInference provides a comprehensive suite of tools to guide researchers through the application of multiple inference algorithms and computation of protein-level, set-based false discovery rates (FDR) from tandem mass spectrometry (MS/MS) data using a unified interface. Here, we describe the software and its application to a K562 whole-cell lysate as well as in a CRAF affinity-purification mass spectrometry experiment to demonstrate its utility in facilitating conclusions about underlying biological mechanisms in proteomic data.

4

Mining Mass Spectra for Peptide Facts

Lemieux, S.; Zumer, J.

2023-11-01 bioinformatics 10.1101/2023.10.27.564468 medRxiv

Top 0.1%

76.2%

Show abstract

The current mainstream software for peptide-centric tandem mass spectrometry data analysis can be categorized as either database-driven, which rely on a library of mass spectra to identify the peptide associated with novel query spectra, or de novo sequencing-based, which aim to find the entire peptide sequence by relying only on the query mass spectrum. While the first paradigm currently produces state-of-the-art results in peptide identification tasks, it does not inherently make use of information present in the query mass spectrum itself to refine identifications. Meanwhile, de novo approaches attempt to solve a complex problem in one go, without any search space constraints in the general case, leading to comparatively poor results. In this paper, we decompose the de novo problem into putatively easier subproblems, and we show that peptide identification rates of database-driven methods may be improved in terms of peptide identification rate by solving one such subsproblem without requiring a solution for the complete de novo task. We demonstrate this using a de novo peptide length prediction task as the chosen subproblem. As a first prototype, we show that a deep learning-based length prediction model increases peptide identification rates in the ProteomeTools dataset as part of an Pepid-based identification pipeline. Using the predicted information to better rank the candidates, we show that combining ideas from the two paradigms produces clear benefits in this setting. We propose that the next generation of peptide-centric tandem mass spectrometry identification methods should combine elements of these paradigms by mining facts "de novo; about the peptide represented in a spectrum, while simultaneously limiting the search space with a peptide candidates database.

5

Universal toolset for mass spectrometric analysis of intracellular peptidome and small protein fraction

Kote, S.; Pirog, A.; Faktor, J.; Dziadosz, A.; Marek-Trzonkowska, N.

2024-12-12 cancer biology 10.1101/2024.12.06.627199 medRxiv

Top 0.1%

75.8%

Show abstract

The analysis of native intracellular peptidome has gained significant attention in recent years. However, there is still a need for more knowledge regarding various sample preparation methods that facilitate efficient and reproducible recovery of peptides, which can then be analyzed using quantitative liquid chromatography-mass spectrometry. A similar situation exists in small proteome research, typically defined as polypeptides with masses of less than 100 amino acids, often too long for easy identification without enzymatic digestion. In this context, we describe a set of methods that involve simple denaturation and solid-phase extraction of polypeptides, applicable for isolating short intracellular polypeptides within the desired length range. Our work demonstrates the efficiency and reproducibility of these methods for quantitative analysis of the peptidome in mammalian cells. Additionally, we investigated the flexibility of adjusting the mass range through ultrafiltration. We have shown that these methods can be adapted for highly efficient enrichment and fractionation of small proteins, resulting in polypeptide isolates suitable for tryptic digestion and intact protein analysis. Moreover, we describe the use of freely available computational tools that can effectively manage the analysis of the resulting data. The research presented here will benefit the global scientific community in both fundamental (protein turnover, proteolytic processing, non-canonical open reading frames, etc.) and applied sciences (bioactive/neuro peptide discovery, precision medicine, vaccines, etc.), and other areas that could benefit from selective analysis of short native polypeptides.

6

Proteogenomics analysis of human tissues using pangenomes

Wang, D.; Bouwmeester, R.; Zheng, P.; Dai, C.; Puente, A. S.; Shu, K.; Bai, M.; Umer, H. M.; Perez-Riverol, Y.

2024-05-28 bioinformatics 10.1101/2024.05.24.595489 medRxiv

Top 0.1%

74.2%

Show abstract

The genomics landscape is evolving with the emergence of pangenomes, challenging the conventional single-reference genome model. The new human pangenome reference provides an extra dimension by incorporating variations observed in different human populations. However, the increasing use of pangenomes in human reference databases poses challenges for proteomics, which currently relies on UniProt canonical/isoform-based reference proteomics. Including more variant information in human proteomes, such as small and long open reading frames and pseudogenes, prompts the development of complex proteogenomics pipelines for analysis and validation. This study explores the advantages of pangenomes, particularly the human reference pangenome, on proteomics, and large-scale proteogenomics studies. We reanalyze two large human tissue datasets using the quantms workflow to identify novel peptides and variant proteins from the pangenome samples. Using three search engines SAGE, COMET, and MSGF+ followed by Percolator we analyzed 91,833,481 MS/MS spectra from more than 30 normal human tissues. We developed a robust deep-learning framework to validate the novel peptides based on DeepLC, MS2PIP and pyspectrumAI. The results yielded 170142 novel peptide spectrum matches, 4991 novel peptide sequences, and 3921 single amino acid variants, corresponding to 2367 genes across five population groups, demonstrating the effectiveness of our proteogenomics approach using the recent pangenome references.

7

Unipept in 2024: Expanding metaproteomics analysis with support for missed cleavages, semi-tryptic and non-tryptic peptides

Vande Moortele, T.; Devlaminck, B.; Van de Vyver, S.; Van Den Bossche, T.; Martens, L.; Dawyndt, P.; Mesuere, B.; Verschaffelt, P.

2024-11-27 bioinformatics 10.1101/2024.09.26.615136 medRxiv

Top 0.1%

73.8%

Show abstract

Unipept, a pioneering software tool in metaproteomics, has significantly advanced the analysis of complex ecosystems by facilitating both taxonomic and functional insights from environmental samples. From the onset, Unipepts capabilities focused on tryptic peptides, utilizing the predictability and consistency of trypsin digestion to efficiently construct a protein reference database. However, the evolving landscape of proteomics and emerging fields like immunopeptidomics necessitate a more versatile approach that extends beyond the analysis of tryptic peptides. In this article, we present a significant update to the underlying index structure of Unipept, which is now powered by a Sparse Suffix Array index. This advancement enables the analysis of semi-tryptic peptides, peptides with missed cleavages, and non-tryptic peptides such as those encountered in other research fields such as immunopeptidomics (e.g. MHC- and HLA-peptides). This new index benefits all tools in the Unipept ecosystem such as the web application, desktop tool, API and command line interface. A benchmark study highlights significantly improved performance in handling missed cleavages, preserving the same level of accuracy. For TOC Only O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=200 SRC="FIGDIR/small/615136v2_ufig1.gif" ALT="Figure 1"> View larger version (32K): org.highwire.dtl.DTLVardef@5b8fe2org.highwire.dtl.DTLVardef@1435321org.highwire.dtl.DTLVardef@106a568org.highwire.dtl.DTLVardef@15563e2_HPS_FORMAT_FIGEXP M_FIG C_FIG

8

DIGEST: An online tool for designing of multiple reaction monitoring assays

gautam, p.; Singh, P.; Mrinal, ; Bhaskar, A.; Sacher, S.; Dagar, Y.; Basak, T.; Sengupta, S.; Ray, A.

2023-11-28 bioinformatics 10.1101/2023.11.27.568790 medRxiv

Top 0.1%

73.7%

Show abstract

Targeted proteomics using multiple reaction monitoring (MRM) assays enables fast and sensitive detection of a preselected set of target peptides. This technique utilizes the specificity of precursors to product transitions for quantitative analysis of multiple proteins in a single sample. The success of an MRM experiment depends on the selection of transitions however, given the existing resources, accurately predicting signal intensity of peptides and their fragmentation patterns ab initio is challenging task. We present an alternative for rapid design of MRM transitions for proteomics research: DIGEST. Our method predicts the b and y ions with +1 and +2 charge produced in a collision cell of a mass spectrometer from peptides of multiple proteotypically digested proteins. Additionally, by using the existing knowledge of the fundamental rules for designing transitions, the tool provides optimal MRM transitions, negating the need to undertake prior "discovery" MS studies. We demonstrate that our algorithm is directed toward the selection of MRM precursor and product-ions pairs, and can avoid the pitfalls of interference due to cross-contamination of samples by selecting ion combinations that uniquely map to target peptides. Comparison with SRMAtlas showed that DIGEST successfully predicted the peptide and production pairs in the majority of cases. We believe that DIGEST will facilitate rapid design of MRM assays with increased specificity, reducing the overall time required to design an MRM assay for routine mass-spectrometry. DIGEST is available as a web-based tool at https://digest.raylab.iiitd.edu.in/

9

In silico approach to accelerate the development of mass spectrometry-based proteomics methods for detection of viral proteins: Application to COVID-19

Jenkins, C.; Orsburn, B.

2020-03-10 biochemistry 10.1101/2020.03.08.980383 medRxiv

Top 0.1%

72.2%

Show abstract

We describe a method for rapid in silico selection of diagnostic peptides from newly described viral pathogens and applied this approach to SARS-CoV-2/COVID-19. This approach is multi-tiered, beginning with compiling the theoretical protein sequences from genomic derived data. In the case of SARS-CoV-2 we begin with 496 peptides that would be produced by proteolytic digestion of the viral proteins. To eliminate peptides that would cause cross-reactivity and false positives we remove peptides from consideration that have sequence homology or similar chemical characteristics using a progressively larger database of background peptides. Using this pipeline, we can remove 47 peptides from consideration as diagnostic due to the presence of peptides derived from the human proteome. To address the complexity of the human microbiome, we describe a method to create a database of all proteins of relevant abundance in the saliva microbiome. By utilizing a protein-based approach to the microbiome we can more accurately identify peptides that will be problematic in COVID-19 studies which removes 12 peptides from consideration. To identify diagnostic peptides, another 7 peptides are flagged for removal following comparison to the proteome backgrounds of viral and bacterial pathogens of similar clinical presentation. By aligning the protein sequences of SARS-CoV-2 field isolates deposited to date we can identify peptides for removal due to their presence in highly variable regions that may lead to false negatives as the pathogen evolves. We provide maps of these regions and highlight 3 peptides that should be avoided as potential diagnostic or vaccine targets. Finally, we leverage publicly deposited proteomics data from human cells infected with SARS-CoV-2, as well as a second study with the closely related MERS-CoV to identify the two proteins of highest abundance in human infections. The resulting final list contains the 24 peptides most unique and diagnostic of SARS-CoV-2 infections. These peptides represent the best targets for the development of antibodies are clinical diagnostics. To demonstrate one application of this we model peptide fragmentation using a deep learning tool to rapidly generate targeted LCMS assays and data processing method for detecting CoVID-19 infected patient samples. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=156 HEIGHT=200 SRC="FIGDIR/small/980383v2_ufig1.gif" ALT="Figure 1"> View larger version (37K): org.highwire.dtl.DTLVardef@1d7fd4borg.highwire.dtl.DTLVardef@136563borg.highwire.dtl.DTLVardef@57641dorg.highwire.dtl.DTLVardef@16de9a4_HPS_FORMAT_FIGEXP M_FIG C_FIG

10

Peptide-to-protein data aggregation using Fisher's method improves target identification in chemical proteomics

Lyu, H.; Gharibi, H.; Meng, Z.; Sokolova, B.; Zhang, X.; Zubarev, R.

2026-02-04 bioinformatics 10.64898/2026.02.02.702201 medRxiv

Top 0.1%

71.0%

Show abstract

Protein-level statistical tests in proteomics aimed at obtaining p-value are conventionally made on protein abundances aggregated from peptide data. This integral approach overlooks peptide-level heterogeneity and ignores important information coded in individual peptide data, while protein p-value can also be obtained by Fishers method of combining peptide p-values using chi-square statistics. Here we test this latter approach across diverse chemical proteomics datasets based on assessments of protein expression, solubility and protease accessibility. Using the top four peptides ranked by their p-values consistently outperformed protein-level analysis and avoided biases introduced by inclusion of deviant peptides or imputation of missing peptide values. Fishers method provides a simple and robust strategy, improving identification of regulated/shifted proteins in diverse proteomics assays.

11

Improving power while controlling the false discovery rate when only a subset of peptides are relevant

Lin, A.; Plubell, D. L.; Keich, U.; Noble, W. S.

2020-10-21 bioinformatics 10.1101/2020.10.20.347278 medRxiv

Top 0.1%

70.1%

Show abstract

The standard proteomics database search strategy involves searching spectra against a peptide database and estimating the false discovery rate (FDR) of the resulting set of peptide-spectrum matches. One assumption of this protocol is that all the peptides in the database are relevant to the hypothesis being investigated. However, in settings where researchers are interested in a subset of peptides, alternative search and FDR control strategies are needed. Recently, two methods were proposed to address this problem: subset-search and all-sub. We show that both methods fail to control the FDR. For subset-search, this failure is due to the presence of "neighbor" peptides, which are defined as irrelevant peptides with a similar precursor mass and fragmentation spectrum as a relevant peptide. Not considering neighbors compromises the FDR estimate because a spectrum generated by an irrelevant peptide can incorrectly match well to a relevant peptide. Therefore, we have developed a new method, "filter then subsetneighbor search" (FSNS), that accounts for neighbor peptides. We show evidence that FSNS properly controls the FDR when neighbors are present and that FSNS outperforms group-FDR, the only other method able to control the FDR relative to a subset of relevant peptides.

12

Cloud-based DIA data analysis module for signal refinement improves accuracy and throughput of large datasets.

Christianson, K. E.; Jaffe, J. D.; Carr, S. A.; Vaca Jacome, A. S.

2021-07-14 bioinformatics 10.1101/2021.07.14.452243 medRxiv

Top 0.1%

69.5%

Show abstract

Data-independent acquisition (DIA) is a powerful mass spectrometry method that promises higher coverage, reproducibility, and throughput than traditional quantitative proteomics approaches. However, the complexity of DIA data caused by fragmentation of co-isolating peptides presents significant challenges for confident assignment of identity and quantity, information that is essential for deriving meaningful biological insight from the data. To overcome this problem, we previously developed Avant-garde, a tool for automated signal refinement of DIA and other targeted mass spectrometry data. AvG is designed to work alongside existing tools for peptide detection to address the reliability and quantitative suitability of signals extracted for the identified peptides. While its use is straightforward and offers efficient refinement for small datasets, the execution of AvG for large DIA datasets is time-consuming, especially if run with limited computational resources. To overcome these limitations, we present here an improved, cloud-based implementation of the AvG algorithm deployed on Terra, a user-friendly cloud-based platform for large-scale data analysis and sharing, as an accessible and standardized resource to the wider community.

13

Integrated View of Baseline Protein Expression in Human Tissues using public Data Independent Acquisition datasets

Prakash, A.; Collins, A.; Vilmovsky, L.; Fexova, S.; Vizcaino, J. A.; Jones, A. R.

2024-09-19 bioinformatics 10.1101/2024.09.16.613191 medRxiv

Top 0.1%

68.9%

Show abstract

The PRIDE database is the largest public data repository of mass spectrometry-based proteomics data and currently stores more than 40,000 datasets covering a wide range of organisms, experimental techniques and biological conditions. During the past few years, PRIDE has seen a significant increase in the amount of submitted Data-Independent Acquisition (DIA) proteomics datasets. This provides an excellent opportunity for large scale data reanalysis and reuse. We have reanalysed 15 public label-free DIA datasets across various healthy human tissues, to provide a state-of-the-art view of the human proteome in baseline conditions (without any perturbations). We computed baseline protein abundances and compared them across various tissues, samples and datasets. Our second aim was to compare protein abundances obtained here from the results of previous analyses using human baseline Data-Dependent Acquisition (DDA) datasets. We observed a good correlation across some tissues, especially in liver and colon but weak correlations were found in others, such as lung and pancreas. The reanalysed results including protein abundance values and curated metadata are made available to view and download from the resource Expression Atlas. For TOC Only O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=126 SRC="FIGDIR/small/613191v2_ufig1.gif" ALT="Figure 1"> View larger version (26K): org.highwire.dtl.DTLVardef@1414dforg.highwire.dtl.DTLVardef@66606aorg.highwire.dtl.DTLVardef@143f862org.highwire.dtl.DTLVardef@1682406_HPS_FORMAT_FIGEXP M_FIG C_FIG

14

Sensitive and specific spectral library searching with COSS and Percolator

Shiferaw, G. A.; Gabriels, R.; Vandermarliere, E.; Martens, L.; Volders, P.-J.

2021-04-11 bioinformatics 10.1101/2021.04.09.438700 medRxiv

Top 0.1%

66.7%

Show abstract

Maintaining high sensitivity while limiting false positives is a key challenge in peptide identification from mass spectrometry data. Here, we therefore investigate the effects of integrating the machine learning-based post-processor Percolator into our spectral library searching tool COSS. To evaluate the effects of this post-processing, we have used forty data sets from two different projects and have searched these against the NIST and MassIVE spectral libraries. The searching is carried out using two spectral library search tools, COSS and MSPepSearch with and without Percolator post-processing, and using sequence database search engine MS-GF+ as a baseline comparator. The addition of the Percolator rescoring step to COSS is effective and results in a substantial improvement in sensitivity and specificity of the identifications. COSS is freely available as open source under the permissive Apache2 license, and binaries and source code are found at https://github.com/compomics/COSS

15

Evaluation of Parallel Accumulation-Serial Fragmentation methods for metaproteomics using a model microbiome

Shrestha, R.; Rajczewski, A. T.; Do, K.; Willetts, M.; Kleiner, M.; Griffin, T.; Jagtap, P. D.

2025-08-15 biochemistry 10.1101/2025.08.13.670166 medRxiv

Top 0.1%

66.2%

Show abstract

Mass spectrometry-based metaproteomics allows for the identification and quantification of thousands of proteins from clinical and environmental samples and is rapidly gaining importance in microbiome sciences. Metaproteomics researchers can measure taxonomic and functional abundances of microbiomes, shedding light on mechanistic details of microbiome interactions with their environment. However, metaproteomic analysis suffers from limited depth of coverage due to the presence of millions of peptides at lower abundance levels. Recent advances in data-independent acquisition mass spectrometry coupled with Parallel Accumulation-Serial Fragmentation (PASEF) technology offer improved depth of coverage. PASEF technology enables simultaneous accumulation of ions from multiple co-eluting peptides by combining ion mobility separation with dynamic quadrupole isolation, allowing efficient and selective fragmentation in a single scan. This boosts ion sampling efficiency and resolves overlapping signals with high sensitivity. In this study, we assessed proteome coverage, quantitative precision, and accuracy of Data-dependent acquisition (DDA) and Data-independent acquisition (DIA) methods coupled with the PASEF method. For this, we used a ground-truth mock community containing 28 species (30 strains) from all three domains of life and bacteriophages with a 400-fold dynamic range of organism abundance. Our results showed that diaPASEF demonstrated superior performance, identifying 168% more peptide precursors, 155% more peptides, and 66% more protein groups compared to ddaPASEF. Quantitative measurements showed improved precision with diaPASEF, with 26 out of 28 organisms exhibiting coefficient of variation values below 20%, compared to 24 organisms with ddaPASEF. Both ddaPASEF and diaPASEF methods accurately quantified the 22 most abundant organisms, while measurements of low-abundance bacteriophages showed significant deviation from expected values. Our findings demonstrate that diaPASEF provides enhanced depth of coverage and quantitative reliability for metaproteomics analysis, particularly beneficial for clinical and environmental microbiome studies where deeper functional characterization is essential. This study provides valuable benchmark data to facilitate the development of advanced bioinformatic methods for quantitative metaproteomics.

16

A flexible workflow for building spectral libraries from narrow window data independent acquisition mass spectrometry data

Heil, L. R.; Fondrie, W. E.; McGann, C. D.; Federation, A. J.; Noble, W. S.; MacCoss, M. J.; Keich, U.

2021-11-22 bioinformatics 10.1101/2021.11.22.469568 medRxiv

Top 0.1%

65.3%

Show abstract

Advances in library-based methods for peptide detection from data independent acquisition (DIA) mass spectrometry have made it possible to detect and quantify tens of thousands of peptides in a single mass spectrometry run. However, many of these methods rely on a comprehensive, high quality spectral library containing information about the expected retention time and fragmentation patterns of peptides in the sample. Empirical spectral libraries are often generated through data-dependent acquisition and may suffer from biases as a result. Spectral libraries can be generated in silico but these models are not trained to handle all possible post-translational modifications. Here, we propose a false discovery rate controlled spectrum-centric search workflow to generate spectral libraries directly from gas-phase fractionated DIA tandem mass spectrometry data. We demonstrate that this strategy is able to detect phosphorylated peptides and can be used to generate a spectral library for accurate peptide detection and quantitation in wide window DIA data. We compare the results of this search workflow to other library-free approaches and demonstrate that our search is competitive in terms of accuracy and sensitivity. These results demonstrate that the proposed workflow has the capacity to generate spectral libraries while avoiding the limitations of other methods.

17

PepGM: A probabilistic graphical model for taxonomic inference of viral proteome samples with associated confidence scores

Holstein, T.; Kistner, F.; Martens, L.; Muth, T.

2022-09-21 bioinformatics 10.1101/2022.09.21.508832 medRxiv

Top 0.1%

65.1%

Show abstract

MotivationInferring taxonomy in mass spectrometry-based shotgun proteomics is a complex task. In multi-species or viral samples of unknown taxonomic origin, the presence of proteins and corresponding taxa must be inferred from a list of identified peptides which is often complicated by protein homology: many proteins do not only share peptides within a taxon but also between taxa. However, correct taxonomic identification is crucial when identifying different viral strains with high sequence homology - considering, e.g., the different epidemiological characteristics of the various strains of SARS-CoV-2. Additionally, many viruses mutate frequently, further complicating the correct assignment of virus proteomic samples. ResultsWe present PepGM, a probabilistic graphical for the taxonomic assignment of virus proteomic samples with strain-level resolution and associated confidence scores. PepGM combines the results of a standard proteomic database search algorithm with belief propagation to calculate the marginal distributions, and thus confidence score, for potential taxonomic assignments. We demonstrate the performance of PepGM using several publicly available virus proteomic datasets, showing its strain-level resolution performance. In two out of eight cases, the taxonomic assignments were only correct on species level, which PepGM clearly indicates by lower confidence scores. Availability and ImplementationPepGM is written in Python and embedded into a Snakemake workflow. Its is available at https://github.com/BAMeScience/PepGM

18

Benchmarking PSM identification tools for single cell proteomics

Van Der Watt, D.; Boekweg, H.; Truong, T.; Guise, A. J.; Plowey, E. D.; Kelly, R. T.; Payne, S. H.

2021-08-18 bioinformatics 10.1101/2021.08.17.456676 medRxiv

Top 0.1%

64.1%

Show abstract

Single cell proteomics is an emerging sub-field within proteomics with the potential to revolutionize our understanding of cellular heterogeneity and interactions. Recent efforts have largely focused on technological advancements in sample preparation, chromatography and instrumentation to enable measuring proteins present in these ultra-limited samples. Although advancements in data acquisition have rapidly improved our ability to analyze single cells, the software pipelines used in data analysis were originally written for traditional bulk samples and their performance on single cell data has not been investigated. We benchmarked five popular peptide identification tools on single cell proteomics data. We found that MetaMorpheus achieved the greatest number of peptide spectrum matches at a 1% false discovery rate. Depending on the tool, we also find that post processing machine learning can improve spectrum identification results by up to [~]40%. Although rescoring leads to a greater number of peptide spectrum matches, these new results typically are generated by 3rd party tools and have no way of being utilized by the primary pipeline for quantification. Exploration of novel metrics for machine learning algorithms will continue to improve performance.

19

Beyond Delta Masses: MS Andrea Directly Resolves Combinatorial Peptide Modifications in Open Searches

Buur, L. M.; Winkler, S.; Dorfer, V.

2026-03-31 molecular biology 10.64898/2026.03.27.714851 medRxiv

Top 0.1%

62.9%

Show abstract

Open modification search (OMS) strategies have gained popularity in mass spectrometry-based proteomics for identification of peptides carrying unknown or unexpected post-translational modifications. However, most OMS search engines report only the overall mass difference between the precursor and the matched peptide and do not explicitly identify or score combinations of multiple modifications at the peptide-spectrum match (PSM) level, leaving the interpretation of mass shifts up to the end user and to using downstream analysis tools. Here, we introduce MS Andrea, a novel OMS search engine developed to directly identify and score combinations of up to four variable modifications per peptide without having to predefine them. MS Andrea uses a sequence tag-based strategy to efficiently filter candidate peptides prior to scoring. Remaining candidates are evaluated using the MS Amanda scoring function, first considering fixed modifications only, followed by a second scoring stage in which combinations of modifications from the Unimod database are considered based on the observed mass difference and matched to the spectrum. We evaluated MS Andrea using phosphopeptide datasets from HeLa cells and Arabidopsis thaliana and compared its performance with the widely used OMS engines MSFragger and Sage. Across datasets, MS Andrea identified the highest number of PSMs at 1% false discovery rate while achieving comparable peptide-level identifications. Importantly, MS Andrea directly reports modification identities and sites at the PSM level and enables the identification of peptides having up to four variable modifications. Together, these results demonstrate that MS Andrea facilitates more detailed and interpretable characterization of peptide modifications while maintaining competitive identification performance in OMS-based proteomic analyses. TOC Graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=132 SRC="FIGDIR/small/714851v1_ufig1.gif" ALT="Figure 1"> View larger version (19K): org.highwire.dtl.DTLVardef@52f65forg.highwire.dtl.DTLVardef@acf4e3org.highwire.dtl.DTLVardef@10171caorg.highwire.dtl.DTLVardef@1d594ad_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

Pho-Tip: one-pot dephosphorylation for rapid and sensitive analysis of DIA phosphoproteomics data

Faisst, K. D.; Lau, K.; Sinn, L. R.; Szyrwiel, L.; Demichev, V.

2025-11-18 biochemistry 10.1101/2025.11.17.687597 medRxiv

Top 0.1%

62.6%

Show abstract

Recent advances in instrumentation and data processing have transformed data-independent acquisition (DIA) proteomics into a reliable technology for quantitative profiling of post-translational modifications. However, analysis of DIA phosphoproteomics data is challenging due to the large search space, wherein all combinations of phosphosites on a peptide need to be considered. Current approaches therefore face significant hurdles in detecting low-abundant phosphorylated peptides, in particular when working with low sample amounts. Here we introduce Pho-Tip, a lossless one-pot dephosphorylation strategy. We show that Pho-Tip enables comprehensive mapping of phosphorylated peptide sequences, facilitating streamlined creation of experiment-focused in silico predicted spectral libraries and thus rapid and sensitive analysis of DIA phosphoproteomics experiments.